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Abstract 

Non-random sample selection is a commonplace amongst many empirical studies 
and it appears when an output variable of interest is available only for a restricted 
non-random sub-sample of data. We introduce an extension of the generalized additive 
model which accounts for non-random sample selection by using a selection equation. 
The proposed approach allows for different distributions of the outcome variable, vari¬ 
ous dependence structures between the (outcome and selection) equations through the 
use of copulae, and nonparametric effects on the responses. Parameter estimation is 
carried out within a penalized likelihood and simultaneous equation framework. We es¬ 
tablish asymptotic theory for the proposed penalized spline estimators, which extends 
the recent theoretical results for penalized splines in generalized additive models, such 
as those by Kauermann et al. (2009) and Yoshida &; Naito (2014). The empirical 
effectiveness of the approach is demonstrated through a simulation study. 

Key Words: copula, generalized additive model, non-random sample selection, 
penalized regression spline, selection bias, simultaneous equation estimation. 


1 Introduction 

Non-random sample selection arises when an output variable of interest is available only for 
a restricted non-random sub-sample of the data. This often occurs in sociological, medical 
and economic studies where individuals systematically select themselves into (or out of) the 
sample (see, e.g., Guo & Fraser (2014), chapter 4, Lennox et al. (2012), Vella (1998), Collier 
& Mahoney (1996) and references therein). If the aim is to model an outcome of interest in 
the entire population and the link between its availability in the sub-sample and its observed 
values is through factors which can not be accounted for then any analysis based on the 
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available sub-sample only will yield erroneous model structures and biased conclusions as 
the resulting inference may not extend to the unobserved group of observations. Sample 
selection models allow us to use the entire sample whether or not observations on the output 
variable were generated. In its classical form, it consists of two equations which model the 
probability of inclusion in the sample and the outcome variable through a set of available 
predictors, and of a joint bivariate distribution linking the two equations. 

The sample selection model was first introduced by Gronau (1974), Lewis (1974) and 
Heckman (1974), in the context of estimating the number of working hours and wage rates of 
married women, some of whom did not work. Heckman (1976) formulated a unified approach 
to estimating this model as a simultaneous equation system. In the classical version, the 
error terms of the two equations are assumed to follow a bivariate normal distribution in 
which non-zero correlation indicates the presence of non-random sample selection. Heckman 
(1979) then translated the issue of sample selection into an omitted variable problem and 
proposed a simple and easy to implement estimation method known as two-step procedure. 
However, the method was proved to strongly rely on the assumption of normality and thus 
has been criticized for its lack of robustness to distributional misspecibcation and outliers 
(e.g., Paarsch, 1984; Little & Rubin, 1987; Zuehlke & Zeman, 1991; Zhelonkin, 2013). 

Various modifications and generalizations of the classical sample selection model have 
been proposed in the literature and we mention some of them. A non-parametric approach, 
which lifts the normality assumption, can be found in Das et al. (2003). Here a two-stage 
Heckman’s method is adopted and a linear regression with the Heckman’s selection correc¬ 
tion is replaced with series expansions. Non-parametric methods are also considered in Lee 
(2008) and Chen & Zhou (2010). Semi-parametric techniques in a two-step scenario, similar 
to that in Das et al. (2003), are given in Ahn & Powell (1993), Newey (1999) and Pow¬ 
ell (1994). Other semi-parametric approaches can be found in Gallant & Nychka (1987), 
Powell et al. (1989), Lee (1994a), Lee (1994b), Andrews & Schafgans (1998) and Newey 
(2009). In the Bayesian framework, Chib et al. (2009) deal with non-linear covariate effects 
using Markov chain Monte Carlo simulation techniques and simultaneous equation systems. 
Wiesenfarth & Kneib (2010) further extend this approach by introducing a Bayesian algo¬ 
rithm based on low rank penalized B-splines for non-linear and varying-coefhcient effects and 
Markov random-field priors for spatial effects. A frequentist counterpart of these Bayesian 
methods is discussed in Marra & Radice (2013b) in the context of binary outcomes and 
Marra & Radice (2013a) for the continuous outcome case. Zhelonkin et al. (2012) intro¬ 
duce a procedure for robustifying the Heckman’s two stage estimator by using M-estimators 
of Mallows’ type for both stages. Marchenko & Genton (2012) and Ding (2014) consider 
a bivariate Student-t distribution for the model’s errors as a way of tackling heavy-tailed 
data. Several authors proposed specific copula functions to model the joint distribution 
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of the selection and outcome equations; see, e.g., Prieger (2002) who employes an FGM 
bivariate copula or Lee (1983). In the context of non-random sample selection, a more gen¬ 
eral copula approach, with a focus on Archimedean copulae, can be found in Smith (2003). 
This stream of research is continued in Hasebe & Vijverberg (2012) and Schwiebert (2013). 
As emphasized by Genius & Strazzera (2008), copulae allow for the use of non-Gaussian 
distributions and has the additional benefit of making it possible to specify the marginal 
distributions independently of the dependence structure linking them. Importantly, while 
the copula approach is still fully parametric, it is typically computationally more feasible 
than non/semi-parametric approaches and it still allows the user to assess the sensitivity of 
results to different modeling assumptions. 

The aim of this paper is to introduce a generalized additive model (GAM) which accounts 
for non-random sample selection. Thus, the classical GAM is extended by introducing an 
extra equation which models the selection process. The selection and outcome equations 
are linked by a joint probability distribution which is expressed in terms of a copula. Us¬ 
ing different copulae allows us to capture different types of dependence while keeping the 
marginal distributions of the responses fixed. This approach is flexible and numerically 
tractable at the same time. Secondly, we make a step towards greater flexibility in modeling 
the relationship between predictors and responses by using a semiparametric approach in 
place of commonly used parametric formulae, thus capturing possibly complex non-linear 
relationships. We use penalized regression splines to approximate these relationships and 
employ a penalized likelihood and simultaneous equation estimation framework as presented 
in Marra & Radice (2013a), for instance. Penalized splines were introduced in O’Sullivan 
(1986) and Eilers & Marx (1996) and practical aspects of their use received much attention 
ever since; see Ruppert et al. (2003) and Wood (2006). At the same time, the asymptotic 
theory for penalized spline estimators is relatively new and mainly focuses on models with 
one continuous predictor (e.g., Hall & Opsomer, 2005; Claeskens et ah, 2009; Wang et ah, 
2011). In the GAM context, the asymptotic normality of the penalized spline estimator 
has been proved by Kauermann et ah (2009) and recently extended by Yoshida & Naito 
(2014) to the case of any fixed number of predictors. This paper discusses the asymptotic 
properties of the penalized spline estimator of the proposed model. The asymptotic rate 
of the mean squared error of the linear predictor for the regression equation of interest is 
derived, which also allows us to derive its asymptotic bias and variance as well as its ap¬ 
proximate distribution. The theoretical results are obtained for a general case of any fixed 
number of predictors and when the spline basis increases with the sample size. We show 
that even though the model structure is more complex than that of a classical GAM, the 
asymptotic properties of penalized spline estimators are still valid. Thus, the results extend 
the theoretical foundation of GAMs established so far. 
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The paper is organized as follows. Section 2 briefly describes the classical sample selection 
model as well as Heckman’s estimation procedure. Section 3 introduces a generalized sample 
selection model based on GAMs. Section 4 succinctly describes the estimation approach 
which is based on a penalized likelihood and simultaneous equation framework. In Section 
5, the asymptotic properties of the proposed penalized spline estimator are established. The 
finite-sample performance of the approach is investigated in Section 6. Section 7 discusses 
some extensions. 


2 Classical sample selection model 

The classical sample selection model introduced by Heckman (1974) and Heckman (1979) is 
defined by the system of the following equations 


v*_ 

J 1 i — 

x- 1} /3i + £u, 

V* _ 

xf' ) p 2 + £ 2i , 

Y u = 

1 (Y*i > 0), 

Y 2i = 

Y£r u , 


( 1 ) 

( 2 ) 

(3) 

(4) 


for i = 1,..., n, where Yj* and Y 2i are unobserved latent variables and only values of Y u and 
Y 2t are observed. The symbol !(•) denotes throughout an indicator function. Equation (1) is 
the so-called selection equation determining whether the observation of Y 2i is generated, and 
equation (2) is the output equation of primary interest. Row vectors x (1) = (x^,..., G 
M Dl and x (2 ) = (x^\... ,Xpl) G M D2 contain predictors’ values and are observed for the 
entire sample whereas /3i G and f3 2 G M 9 are unknown parameter vectors. The classical 
sample selection model assumes that the error terms (£u,£ 2 i ) follow the bivariate normal 
distribution 

£li 

£2 i 



for i = 1,..., n. The variance of £u is assumed to be equal to 1 for the usual identification 
reason. The normality of £u implies that the selection equation represents a probit regression 
expressed in terms of a latent variable. The assumption of bivariate normality of error terms 
has been commonly used since. It implies that the log-likelihood of model (l)-(4) can be 
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expressed as (cf. Amemiya (1985) and eq. (10.7.3) therein) 


n 

ft, a, p\Y n , • • •, Yin, Y 21 ,..., Y 2n ) = ^ {(1 - Y u ) log (l - ^(x,! 1 ^)) 

i=i 


+Y, 


1 i 


x ! 1] /3 1 + y(Y 2i - x) Zj /3 2 ) 


log <f> 


( 2 ), 




p z 


log 0 


Y 2 , - xf } /3 2 


cr 


logo- 


where $(•) and 0 (-) denote the standard normal distribution function and density, respec¬ 
tively. Under the model assumptions, maximization of the above function, referred to as 
full-information maximum likelihood, results in consistent estimators of the parameters /3i, 
/3 2 , a and p but until recently was computationally cumbersome. As a consequence Heck¬ 
man (1979) proposed a two-step procedure, also known as limited-information maximum 
likelihood method. It is based on the fact that under the assumption of normality of the 
error terms the following equality holds 


E(e 2i \Y*>0) = vpt (xf^i), 

where £(w) = is the so-called inverse Mills ratio. Thus the conditional expectation of 
the outcome variable given that its value is observed equals 


E (UiIVi > 0) = xf ft + opt (xf'ft) . 

In the first step of Heckman’s procedure, probit model (1) is fitted in order to obtain an 
estimator of (3\ and hence an estimator £* = £ ^x^/3i j of £ ^x- 1 ' ) /3i j. In the second step 
the following regression equation is estimated through the ordinary least squares method 


Y 2i = x[ 2 , /3 2 + 7 ii + £ 2i , i E {j : Y* > 0}, 

using the selected sub-sample only, where £* is an added variable and 7 = up is the cor¬ 
responding unknown parameter. After obtaining estimators /3 2 and 7 , parameter a can be 
estimated as a = ^n. 7 1 ^£ 2 j + n J 1 £i + x| 1 ' ) /3 1 ^ where n s is the size of the sub¬ 
sample, i.e. the cardinality of {j : Y,* > 0} (see, for instance, Toomet & Henningsen (2008)). 
Thus the estimator of the correlation coefficient p is defined as p = 7 /d. 

Although the inverse Mills ratio £(w) is known to be nonlinear, in practice it often turns 
out to be approximately linear for most values in the u range. This may lead to identification 
issues if both equations include the same set of predictors (Puhani, 2000, p. 57). Thus, an 
exclusion restriction, which requires the availability of at least one predictor which is related 
to selection process but that has no direct effect on the outcome, has to be used in empirical 
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applications. As mentioned in the introduction, Heckman’s model received a number of 
criticisms due to its lack of robustness to departures from the model assumptions (see, 
e.g., Paarsch, 1984; Little & Rubin, 1987; Zuehlke & Zeman, 1991; Zhelonkin, 2013, and 
references therein). 

3 Generalized additive sample selection model 

We structure the proposed sample selection model in the following way. We first assume that 
the outcome variable of interest can be described by a generalized additive model (Hastie 
& Tibshirani, 1990). Then, to take the selection process into account we extend the model 
by adding a selection equation, which is also in the form of a generalized additive model. 
Finally, the two model equations are linked using a bivariate copula. 

3.1 Random component 

Let iq and F 2 denote the cumulative distribution functions of the latent selection variable Y* 
and output variable of interest Kf, respectively. Analogically, /i and f 2 denote the density 
functions of Y* and Y 2 . We assume that Y 2 has density that belongs to an exponential 
family of distributions, i.e. it is of the form 

/ 2 (j/ 21 *72,0) = exp | V2T]2 ^ b ^ + c(y 2 , <j>) j , (5) 

for some specific functions b(-) and c(-), where r) 2 is the natural parameter and 0 is the scale 
parameter. For simplicity, we assume that 0 = 1, so that 

f 2 (V 2 \V 2 , 0) = f 2 (y 2 \m) = ex P { 2 / 2*72 - b(rj 2 ) + c(y 2 )} . (6) 

It holds E(Y 2 *) = b'{j ] 2 ) and Var(F 2 *) = b"(r] 2 ), where tf(-) and b"(-) are the first and second 
derivatives of function b(-), respectively (van der Vaart, 2000, p. 38). 

Moreover, we assume that the latent selection variable Yj* follows a normal distribution 
with mean r/i and variance equal to 1, 

fi(yilm) = exp (-0/1 - r/i) 2 ) . (7) 

The assumption Var(Y 1 *) = 1 is needed for the usual identification purpose. Then the 
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observables are defined as before, 


Yl = 1(17 >o), 
y 2 = y*y u 

implying the probit regression model for the selection variable Y\. 

We specify the dependence structure between the two variables by taking advantage of 
the Sklar’s theorem (Sklar, 1959). It states that for any two random variables there exists a 
two-place function, called copula, which represents the joint cumulative distribution function 
of the pair in a manner which makes a clear distinction between the marginal distributions 
and the form of dependence between them. Thus a copula is a function that binds together 
the margins in order to form the joint cdf of the pair (Yj*, Kf). An exhaustive introduction 
to copula theory can be found in Nelsen (2006), Joe (1997) and Schweizer (1991). As in our 
model we would like to be able to specify the marginal distributions of Yj* and Y 2 *, the use 
of copulae is a convenient approach which allows us to achieve modeling flexibility. Often 
copulae are parametrized with the so called association parameter 9 which while varying 
leads to families of copulae with different strength of dependence. Thus, we use the symbol 
Cg(-, •) throughout to denote a copula parametrized with 6. Let the function Cg be the 
copula such that the joint cdf of (Yj*, Yj*) is equal to 

F(yi,m) = Co(F 1 (y 1 ),F 2 (y 2 )). ( 8 ) 

Function Cg always exists and is unique for every (y 1; y 2 ) in the support of the joint distri¬ 
bution F. Then the joint density of (Yj*, Yj* ) takes the form: 

fi(yi)f-2(y2)- 

u — F 1 (y 1 ) 

v=F 2 (v2) 

The log-likelihood function for such defined sample selection model can be obtained by 
conditioning with respect to the value of the selection variable Yj (cf. Smith (2003), p. 
108). If Y\ = 0 then the likelihood takes the simple form of P(Yj = 0), which is equiv¬ 
alent to F\j (0). Otherwise, the likelihood can be expressed, using the multiplication rule, 
as P(Yj* > 0)/ 2 |i(y 2 |lj* > 0), where f- 2 \i denotes the conditional probability density func¬ 
tion of Yj given Yj* > 0. After substituting the conditional density / 2 |i (y 2 |Yj* > 0) by 
p(Y{>o) FY ~ ^(°^2)), we obtain the following log-likelihood 

e = (i - U) logUO) + v; log (/ 2 (y 2 ) - , 


f(yi,V2) = 


d 2 

dudv 


Cg(u,V ) 
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from which, using ( 8 ), we obtain 


£ = (1 - Yi) logFi(O) + n log (/ 2 (y 2 ) (1 - z(y 2 , 77 !, 772 ))), 


where 

d 

-(7/2,7/1,772) = ~^Ce{Fi{ 0 ),u)|,^ F2te) . 
The function z can be also expressed as 


2(3/2,771,772) = p(y* < 0) 


/2n(7/ 2 |y; < o) 

72 ( 7 / 2 ) 


Thus it is directly related to the conditional distribution of the variable Y 2 * for the unobserved 
data. 

Using ( 6 ), the log-likelihood can be written as 


£ = (1 - y) logFi(o) + yi(77 2 y 2 - 6(772) + c(y 2 ) + log (i - z (v 2 , m , V2 )). ( 9 ) 


The fact that E(Y 2 ) = 63 ( 772 ) implies 


d_ 

drj 2 


£ 


y l (y 2 - /i 2 ) + y 


d 


1 - ^(>2, 771, 772) 6)772 


^ 2 , 771 , 772 ), 


where /i 2 = E(y 2 ). Note that the first term in the expression above is equal to the score 
for a classical generalized linear model, i.e. when the sample selection does not appear and 
thus Y\ always equals 1 and z(Y 2 , 771 , 772 ) = 0. The second term corrects the score for sample 
selection bias. Interestingly, using the fact that the expected value of the score equals zero, 
we obtain that the expected value of the second term of the score equals minus the covariance 
between Yi and Y 2 , i.e. 


Cov(Yi, Y 2 ) 


—E 



4 y(yu?i, 772) \ 

1 - 2^2,771,772) J 


which provides another interesting interpretation of function z(Y 2 ,r)i,r) 2 ). 


3.1.1 Archimedean copulae 

Although any copula can be used to link the two model equations ( 6 ) and (7), the class 
of Archimedean copulae is particularly attractive for fitting the model in practice. This 
is because it provides many useful distributions that posses the advantage of analytical 
simplicity and dimension reduction because they are generated by a given function ip : 



[ 0 , 1 ] —> [ 0 , oo) of only one argument such that 


ip{C e {u,v)) = ip{u)+ip{v). 


Function (p is called a generator function and is assumed to be additive, continuous, con¬ 
vex, decreasing and meets the condition y>(l) = 0. Table 1 lists several popular bivariate 
Archimedean copulae whereas Figure 1 shows the contour plots of bivariate densities for 
normal and three Archimedean copulae (Clayton, Joe and Frank). 

For Archimedean copulae we have 


= 1 ) 



Yjfziy?) 


vUCo))' 


where F 2 = F 2 (y 2 ) and C e = C e {F l {ti),F 2 {y 2 )). Thus 


E (n*in = a = (e «*) - • 


Hence the selection bias is equal to 


1 

PM) 


(V(Fi = o)e (y 2 *) 



yf2{y) 


f\f 2 ) 

F\C e ) 



Name 

Co(u,v) 

Parameter space 

Generator ip(t) 

Clayton 

(u- e +v- e ~iy 1/0 

9 £ (0, 00 ) 

(■ t~ 9 - 1) 9- 1 

Joe 

1 - [(1 - u) e + (1 - v) e - (1 - w) e (l - w ) e ] 1/6 

9 £ (1, 00 ) 

- log (1 - (1 -t) e ) 

Frank 

-9- 1 log [l + (e~ eu - l)(e~ 0v - l)/(e~ ff - 1)] 

9 £ R\{0} 

-logoff 

Gumbel 

exp j- [(- logu) e + (- log a) 0 ] 1/6 j 

9 £ [1, 00 ) 

(- log!)" 

AMH 

uv/ [1 — 0(1 — u)(l — f)] 

d£ [-1,1] 

log '-?-•> 


Table 1: Families of bivariate Archimedean copulae, with corresponding parameter range of 
the association parameter 6 and generator ip(t). 


3.2 Systematic component 

Terms rj\ and r] 2 are assumed to depend on sets of predictors x (1 ^ and x (2 \ respectively, so 
that rji = ? 7 i(x (1 ^) and y 2 = 7/ 2 (x( 2 )), where x (1) = (x^,..., x^j) and x^ = (x^ 2 \ ..., x^). 
Moreover, we assume the following additive form of the functions 771 (x^ 1 )) and 772 (x^ 2 )) 


?7i(x (1) ) 
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Normal 


Clayton 




Frank 




Figure 1: Contour plots of bivariate densities for different copnlae. The margins follow a 
standard normal distribution and a gamma distribution with shape 2 and rate 1. The value 
of dependence parameter 6 is set to provide Kendall’s r = 0.5 for all four distributions. 


and 


77 2 (x (2) ) 



Functions rft (x^) and 772 (x (2 ^) are unknown. We use spline basis functions to represent 
the unknown smooth functions. Specifically, we consider B-splines (De Boor (2001)), which 
have attractive numerical and theoretical properties. 


B-spline approximation 

On interval [0, 1 ] define sequence of knots 0 = k 0 < < ... < kk = 1 and another 2 p 

knots kk = kk+i = • • • = kr+p and — ■ ■ ■ — = «o- Then B-spline basis 

functions of degree p are defined recursively as 


B 


k,p 


X = 


X ~ K k -l 
ftk+p—1 ftk— 1 


^k,p— 1 (*^) 


ftk+p 


X 


ftk-\-p ftk 


-®fc+l,p—1 (*^) 
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for k = — p + 1,..., K, with 


1, K k -i <X<K k 

0 otherwise. 


B k , o(x) = 

This gives K + p basis functions B_ p+ 1 )P (x), ..., B Kp (x). 

We approximate r/j 1 ^ (x) by a linear combination of basis functions: 

I< 

Y a K ,jB kjP (x) for j = 1 ,..., D u 

k=~p +1 

and, similarly, rfp(x) by 

K 

Y pK,jB k>p (x) for j = 1 ,..., D 2 . 

k=—p +1 

In the remaining of the paper we omit the subscript p of basis functions B k p and we use 
symbols B~ p+ i(x), ..., Bk(x) to denote p-th B-spline basis functions. We define vectors 
olj G W +K and /3j G W +R as 

^j (oj—p+ij) • • • i &K,j ) for j 1 ,..., D\ 

and 

= (/3-p+ij, • • •, fe) T for j = 1 ,.. •, L > 2 

and vectors 

n=(a[,...,4 i ) T 6l°^ 

and 

Moreover, we use the following notation throughout. For a given n G N, assume that 
(Yi i, Y 2 i)f =l are independent random variables related to predictors’ values x- 1 ' = (x^, ..., 
and x| 2) = (x^, ..., x^J for i — 1,..., n such that Y u = 1 (Yj* > 0) and F 2 * = Y 2 *Y U , 
where Yj* has density (7) with rj\ = p 1 (x,[ 1 ' ) ) and Y 2 * is distributed according to ( 6 ) with 
V2 = h2(xf } ). 

Let Fu, F- 2 i denote the distribution functions of Yj* and Y 2 * and let Fj(-, •) be the joint 
cdf of the pair (Lj*, Y 2 *). Moreover, for j = 1,..., D x let : n x (p + K ) be a matrix 
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defined through 


xP = 


B-p-\-k \%aa 


( 1 )> 


J* 


I k=l,...,K,i=l,...,n 

and for j = 1,..., D 2 let X ■ : n x (p + K ) be a matrix defined as 


X (2) = 


B_ p+k (x\ ; 


( 2 )', 


J* 


J k=l,...,K,i=l,...,n 

Then let X^ 1 ) : n x D\{p + K) and X (2 ) : n x D 2 (p + K) equal 

X W= I Af 


and 


X (2) 


Af A' 


( 2 ) 

£>2 


With the B-spline approximation, we postulate the parametric model 


Y ii ~ N(X^a, 1) 

Yi ~ h ife|xf>) = exp (jfaXf/3 - 6(xf>/S>) + cfe)) 
where X- 1 "* and X ^" 1 denote the i-th rows of the matrices X^ 1 ) and X* 2 ), respectively. 


4 Some estimation details 

In order to estimate parameters ol, (3 and 9 , we employ a penalized likelihood approach 
which is common for regression spline models. Based on (9) and (10), the log-likelihood 
given a random sample (Yu, Y 2 i)\? =1 equals 

n 

£{ a ,(3,6) = Y / {l-Y li )log F lt {0)+Y li {X^(3Y 2l -b{xf(3)+c{Y 2l )+log{l-z{Y 2l ,cx,f3,9)), 

i —1 

at 

where z(Y 2 i,a, 0,0) = /f 'efAi,fO). ef The normality of if implies that F b ( 0 ) = 

<h(—? 7 ij), where denotes the standard normal distribution function. The penalized log- 
likelihood equals 

Di a D2 

£ p {a, (3, 9) = £(a, (3,9) - - ^ Yl X ? ] Pj A m A mPj = 

i = 1 i =1 

= £{a, (3,9) - VqL 1 ) (A (1) )q - ^(3 T Q^{\ (2) )(3, ( 12 ) 
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where A m : (K + p — m) x (A' + p) is the m-th difference matrix (Marx and Eilers, 1998), 
Ql !>(A (1) ) = diagfA^A^A™,,,,, Ag A^A m ), qS>(A< 2 >) = diag(A< 2) A£A„„.... Ag>A^A m ), 

and A' 1 * — (Ag,...,Ag) and A^ 2) — I'A 2 ''. A'gj are smoothing parameters controlling 

the trade-off between smoothness and fitness. 

For a given A = (A^, A^), we seek to maximize (12). In practice, an iterative procedure 
based on a trust region approach can be used to achieve this. This method is generally more 
stable and faster than its line-search counterparts (such as Newton-Raphson), particularly 
for functions that are, for example, non-concave and/or exhibit regions that are close to flat; 
see Nocedal & Wright (2006, Chapter 4) for full details. Such functions can occur relatively 
frequently in bivariate models, often leading to convergence failures (Andrews, 1999; Butler, 
1996; Chiburis et ah, 2012). Some details related to the process of iterative maximization 
of the penalized likelihood via a trust region algorithm are presented in subsection A.l of 
Appendix A. 

Data-driven and automatic smoothing parameter estimation is pivotal for practical mod¬ 
eling, especially when the data are partly censored as in our case, and each model equation 
contains more than one smooth component. An automatic approach allows us to determine 
the shape of the smooth functions from the data, hence avoiding arbitrary decisions by 
the researcher as to the relevant functional form for continuous variables. For single equa¬ 
tion spline models, there are a number of methods for automatically estimating smoothing 
parameters within a penalized likelihood framework; see Ruppert et al. (2003) and Wood 
(2006) for excellent detailed overviews. In our context, we propose to use the smoothing 
approach based on Un-Biased Risk Estimator as detailed in subsection A.2 of Appendix A. 


5 Asymptotic theory 

In this section, the asymptotic consistency of the linear predictor f/ 2 based on the penalized 
maximum likelihood estimators for the regression equation of interest is shown. In particular, 
the asymptotic rate of the mean squared error of i) 2 is derived. The theoretical considera¬ 
tions also allows to derive its asymptotic bias and variance and its approximate distribution. 


First, we introduce the notation that we use throughout. Denote S = ( ct,/3,9) and let 


G n {S) = (G“(S),G%(S),G°(S)), 


where 


GOT) 


di 

da 



dC d? 

dai ’ ’ ’ ’ ’ da Dl 


e r d ^ k +p +1 \ 
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G e n (S ) = ^(*) G E. 


Analogically, let 


Gn, r (<S) = (G^(i), 0^(4), 0^(4)) = ' 


Proceeding to the hessian matrix, let us denote 

r) 2 P B 2 f B 2 f 

= WW {6> ’ H ^ S) = and30 ° 1L 


Moreover, let 


Analogically, let 


" Hg(S) H^(S) 

H n {8)= Hg»(S) Hg(8) Hg- e (8) 

. *W) *W) 


«„“„(«) = = *W) - QSW). 

r) 2 Q 

K P W = - QS’Of), 

and so on, and H n ^(8) = d d S g£ T (<?)■ 

Let 

W) = E [ff“(<5)], jf(«) = E [/£(«)] , F“-' J W=E[W“.' S (5),] 

C“ p («) = C“(«) - Q«(AW), F&(i) = Fg (i) - Qg)(Af). 

Let <5° denote a parameter vector that satishes the condition EG„(<5°) = 0. Then <5° is 
maximizer of the expected unpenalized log-likelihood and provides the best approximation 
of ( 771 , 772 ) in terms of Kullback-Leibler measure as it minimizes Kullback-Leibler distance. 
We adopt the following assumptions: 

Al. All partial derivatives up to the order 3 of copula function Cg(u,v ) w.r.t u, v and 6 
exist and are bounded. 

A2. The function z(y 2 ,cx, (3,6) is bounded away from 1. 


A3. max/ =1)2; j=i,...,D; (A - ) = 0 (n 7 ) where 7 < 
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A4. The explanatory variables x^- 1 and x^ are distributed on unit cubes [0, l] 151 and 
[0, 1} D2 , respectively. 

A5. The knots of the B-spline basis are equidistantly located so that Kk — Hk-i = K~ ] for 
k — 1,, K n and the dimension of the spline basis satisfies K n = 0(n 1 ^ 2p+2 '' 1 ). 

A6. K n is such that ( Di + D 2 )(K n + p) < n. 

Theorem. Under assumptions (Al)-(A6) the estimate fj(x) has asymptotic expansion 

f,(x)- V°(x) ~ X(x)F p “ 1 (5 0 )Gp(5 0 ), 

which implies 

MSE (fj(x)) = E(rj(x) - rf(x)f = 0(n- (2p+2)/(2p+3) ). 


Before we proceed to the proof of the theorem we derive analytic formulae for the gradient 
and hessian of the penalized log-likelihood as their properties will play a central role in the 
asymptotic considerations. 


Gradient of the penalized likelihood 

Straightforward calculations yield 


G“ P (S) = 


n 

£ 

2—1 


(l-y lj )(*(-X i lI) o: 


-'"Yu™!- 

1 - Zi 




(1) a)xS 1 } -a T gW( A«), 


where </?(•) is the density function of the standard normal distribution and 
VCj = and z t = z(Y 2i ,a, 0,0). Moreover, 


OS) = £ Yu 


1=1 


Y x - 6'(Xf>/3) 


2! 


1 — Zi 


x! 2, -/3 r Q'?(Ai 2) ), 


where 


and 


0 Zi 

dm 


dv 2 


Cg(F u (0),v)\ v=F2i (Y 2i ) 


dF 2i 

dm ’ 



£Ui 


1 dzj 

1 — Zi 06' 


where §f = ^,C g {F u (0),v)\ v=F2i( Y 2i ). Thus we can write the gradients in a matrix form 


G“„(<5) = a T Xl') - a^WfAW), 
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where a = (op, ..., a n ) with a t = - |(1 - Y u ) ($(-Xf } a)) - >ii^| </?(-Xf } a) for 

i — 1 ,... , n. Analogically, 


G£js) = b T X (2 » - /3 T eg>(Af), 


where b = (b lt ..., b„ j with b r = >i, Y 2i - t/(X\ z ';3) - gL. 


for i = 1 ,, n. Moreover, 


G e n JS) = c T l, 

where c = (ci,..., c n ) with q = -Viiyjq-fl and 1 = ( 1 ,..., 1 ) T e 


Hessian 

Straightforward but tedious calculations yield the following lemma. 

Lemma 1. The component matrices of H n (5 ) take the following forms: 

H%{S) = H%(5) = (X^) T W 2 X (2 \ H^(S) = (X^) T W 3 X^ 2 \ 

H e n ’ a {5 ) = 1 t W / 4 X (1) , H e /{8) = 1 t 1L5X (2) , H e /(S) = 1 T W 6 1 , 

where W,- = diag(w^,..., k;^) for j = 1 ,..., 6 and 

^ (1) = (1 - yii)(F li (0))- 1 (F li (0)-V(-Xi 1} a) - 1)^(-Xf ) a)+ (13) 


+Y U 


1 d 3 Ci ( VC' 


1 — z* du 2 dv V 1 — Zj 




VCj 

1 — Zj 


vt-xPa), 


= Ui 




(1 - Zif 


W (1 _ *0 + ( 4 ) 2 ) 


^, (3) = y u 


i «9 3 a 


+ 


|_1 — Zidu 2 dv (1 — Zj) 2 


rVQ 


1 // 

where z- = 77—, 
OT]2 


<9 VQ 


( 4 ) 

w- = 

4 <90 1 - ^ 


‘/•(-xfa), 


(5) V d Z i 
ir) - -Yu 


wf = -Yi — 


<90 1 — Zj ’ 
<9 / 1 <9zj 


<90 V 1 — Zj <90 


(14) 

(15) 

(16) 

(17) 

(18) 
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Corollary 1. The Hessian H n (S) can be written in the following matrix form 



' (x^) T 0 0 


" Wi w 3 w 4 ' 


1 

o 

o 

H n {5) = 

o (x^) T 0 


w 3 w 2 w 5 


0 X( 2 ) 0 


i 

h 

tH 

o 

o 


_ W 4 W 5 W 6 _ 


-1 

rH 

o 

o 


where the matrices W 3 = diag(iup'\ ..., wiP) for j = 1,..., 6 are given by the expressions 
(13)-(18). 


In order to proof Theorem, we use several lemmas, stated below. 

Lemma 2. Under assumptions (Al)-(A6), elements of the matrix F ra (<5°) are of order 
O (j£~) and elements of the matrix F np (d°) are of order O fjr- + max; =1)2;: ,=i ,^\. 


Lemma 3. Under assumptions (Al)-(A6), elements of the matrix (F np (S 0 )) 1 are of order 

OpUT 


Lemma 4. Elements of the matrix H n (S °) — F n (d 0 ) are of order Op 


The proofs of Lemmas 2-4 are presented in Appendix B. 


Proof of Theorem. First, we expand G p (-) around <5°. Let M n = (D\ + D 2 )(K n +p). For 
j = 1,, D\ + D 2 + 1 


0 



d5F ’ 


M n 


i=i 


d% 

08 J dS l 


(. 8°){6 l 


5°' l )+ 


M n M, 


n lvl n 

EE<* 


fO,p 


d 3 L 


-X5°)(d r -5°’ r ) + o(R n ), 


1=1 r= 1 


J dSWS l dS r 

where R„ = ~ <V’)- 
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Series inversion yields 


Wl n 


dL 


-t Wl n Wl n 7 r\ t 

1 x ^ ^ , aL, olr 


»- 4 = - X %#(-*“) + 5 X X (*°> 5 £(«°) + ■ ■ ■ 

/=! 1=1 r= 1 


where dji is (j, /)-element of the inverse of matrix H p (6°) and 

7 _ s~^M n \pM n d% 

u jlr Z^s=l Z^t=1 Z^u= 1 u 'js u 'ltu ru ggsggtQgu \ u )■ 

Then 

(- HAS 0 ))- 1 = ( F P (S °) + (tfp(<5°) - -F p (<5 0 ))) -1 = (Fp(d°) + S(<5 0 )) -1 = 

= FpiS 0 )- 1 - Fp(5°) _1 S'(5 0 )Fp(d°) _1 + Fp(d 0 )~ 1 S'(d 0 )Fp(d 0 ) _1 S'(5°)Fp(5 0 )” 1 + ... = 
= Fp(<5 0 ) -1 (/ - 1 S , ((5 0 )F P (<5 0 )- 1 + SiS^FpiS^SiS^FpiS 0 )- 1 + ...) = 

= Fp(d 0 )- 1 (/ + o( v / iW^)). 

Moreover, 

<9 3 / n <97/;. 

P (^°) = E M 7 B -p+t( x i) B -P+u(xi) = 0 P (n/K n ), 


dd s dd t dS uX ' ^ dS 

2=1 


as YW=i B -p+t( x i) B -P+u( x i) = 0{n/K n ) and |p- is bounded. This yields b jtr = 0(K 2 /n 2 ) 
and in consequence 


M„ 07 07 

EEV§(<S°)§(<S“)=Op(AV' 


/=! r=l 


n) 


Thus 




1=1 


dL 


& ~ 6°’ j = E ( -^(*°) ) + °W K n! 


in) = 


Win 


1=1 


dL 


+ o(l)) + op(K n /n), 


m 


where fji is (j, /)-element of the matrix F p (S°) 1 . Hence we can write the above equation 
a matrix form 

<5 - <5° = -Fp( < 5°)- 1 G'p(5°)(l + o(l)) + op(K n /n). (19) 

Thus assumptions (A3) and (A5) together with the fact that E(G(<5 0 )) = 0 yield 

E 11 fj(x) - fj°(x) II 2 = E(X(x)<5 - X(x)<5°)(X(x)<5 - X(x)5°) T = 0(K ~ (p+1) + X jn K^ m /n). 


□ 


Hence E||r)(x) — 'f)°(a;) 11 2 = 0(n ( 2 p+ 2 )/( 2 p+ 3 )). 
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Remark Expansion (19) yields the approximate variance of rj(x) in the form 


Var (fj(x)) « X(x)F p (S°)- 1 F(S°)F p (S°)- 1 X(x) T , 

which can be shown to be of order 0(K n /n ) as n —y oo. Moreover, using the Central Limit 
Theorem we obtain the approximate distribution of f)(x) as 

fj(x) ~ N(rj°(x), Vax(rj(x))). 


6 Simulations 

In this section, the properties of the proposed generalized additive sample selection model 
are investigated empirically. Specifically, we first assess the effectiveness of the proposed 
approach at finite sample sizes and then provide some evidence of the potential inaccuracy 
arising from modeling transformed outcomes. 

6.1 Empirical consistency 

Data were generated as follows. For the latent selection variable, it was assumed that 

T]* = Oio + Si(xi) + 52(^2) + CH4X4 + 0:5X5 + £j, 

with a 0 = 0.7, si(x) = — 0.2sin(|^x), s 2 (x) = —0.0004(x + O.Olx 1 / 3 ), 0:4 = 0.6, 05 = —0.4 
and £{ ~ IV(0,1), whereas the outcome variable Y , 2 * was assumed to pertain to a gamma 
distribution with shape parameter k = 2 and expected value /.y = E(K,*) such that 

log(Ab) = /3 0 + s 3 (xi) + S4(x 3 ) + ( 34 X 4 + (3 5 x 5 , 

with Bo = —1.5, s 3 (x) = 0.0006exp(0.lx), S 4 (x) = 0.03x, / 3 4 = —1 and = 0.75. The two 
equations were linked using a Gumbel copula with association parameter 6 = 3. The co¬ 
variates were generated as x\ ~ Uniform(16, 66 ), X 2 ~ Uniform(10, 70), x 3 ~ Uniform(0, 20) 
and X4 and x 3 were binary variables taking values 0 and 1 with equal probabilities. 

To investigate the asymptotic behavior of the estimators, data sets of increasing sizes 
were considered: n = 500, 1000, 1500, 2000, 2500, 3000. For each generated data set a gen¬ 
eralized additive sample selection model assuming a gamma distribution for the outcome 
with log link function was fitted using the SemiParSampleSel function from the package 
SemiParSampleSel (Wojtys et ah, 2015) in R (R Core Team, 2015). Additionally, a univari¬ 
ate generalized additive model based only on the observed outcomes was fitted for compari- 
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n 

500 

1000 

1500 

2000 

2500 

3000 

GASSM 

0.0778 

0.0538 

SD(/3 4 ) 

0.0413 

0.0334 

0.0335 

0.0287 

GAM 

0.0663 

0.0492 

0.0370 

0.0298 

0.0293 

0.0263 

GASSM 

0.0720 

0.0469 

sd(ao 

0.0392 

0.0332 

0.0296 

0.0288 

GAM 

0.0650 

0.0395 

0.0344 

0.0311 

0.0298 

0.0258 


Table 2: Empirical standard deviations of the GASSM and GAM estimators of /? 4 = —1 
and /3 5 = 0.75. 

son using the gam function from mgcv (Wood, 2006). The two above models will be referred 
to as GASSM and GAM, respectively. The number of Monte Carlo repetitions was 200. 

The empirical expected values of the GASSM and GAM estimators for /J 4 and /3 5 and 
the square root of their empirical mean squared errors are plotted as a function of sample 
size and shown in Figure 2. Moreover, the mean integrated squared errors (MISE) of the 
estimators for S 3 and s 4 are shown in Figure 3. In all plots, solid lines correspond to the 
GASSM results and dotted lines to those obtained using GAM. The two top plots in Figure 
2 show that GAM estimators are biased, while the GASSM estimators have empirical mean 
values practically identical to the true values. It is worth noting that the GAM estimators 
have smaller standard deviations than those of GASSM (see Table 2). This is because the 
latter acknowledge the uncertainty due the selection mechanism. Importantly, the mean 
squared errors of the GASSM estimators remain uniformly smaller than those for GAM as 
shown in the two bottom plots of Figure 2 . 

Empirical consistency of the estimators for S 3 and s 4 can be studied by looking at the 
results in Figure 3. Interestingly, while the GAM estimator for S 3 has a MISE which is 
considerably larger than that of the GASSM, both estimators for s 4 appear to be equally 
good; a possible interpretation is that function s 4 is linear and the additional parameters 
associated with its basis functions available for estimation may compensate for selection 
bias. 

6.2 Comparison to logged model 

In real world applications, the analysis of data with positive outcomes having a highly skewed 
distribution is often performed using log-transformed outcomes. Then a model assuming a 
normal or t distribution is fitted (see, e.g., the example in Marchenko & Genton (2012)). 
This section shows the potential inaccuracy of this practice. 
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500 1000 1500 2000 2500 3000 500 1000 1500 2000 2500 3000 


Figure 2: Empirical expected values of the GASSM and GAM estimators for /5 4 = —1 
and /?5 = 0.75 (top plots) and square root of their mean squared errors (bottom plots). 
GASSM denotes the generalized additive sample selection model while GAM denotes the 
classic univariate generalized additive model. Solid lines refer to GASSM results whereas 
dotted lines to those produced using GAM. 
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Figure 3: Empirical mean integrated squared errors of the GASSM and GAM estimators 
for S 3 and S 4 . 

For the latent selection variable, it was assumed that 

Y*i = CL 0 + OL\X\ + OL 2 X 2 + OL 3 X 3 + 

with a 0 = 0.58, aq = 2.5, a 2 = —1, aq = 0.8 and £ l ~ N( 0,1), whereas the outcome variable 
y 2 * was assumed to pertain to gamma distribution with shape parameter k = 2 and expected 
value /ij = E(y 2 *) such that 

log (/Xi) = A) + ftaq + / 3 2 ^ 2 , 

with /3 0 = —0.68, (3i = —1.5 and /3 2 = 0.5. Three different patterns of dependence between 
Y* and y 2 * specihed by the normal, Frank and Clayton copulae were considered. For each 
copula, three values of association parameter 9 , corresponding to the values of Kendall’s r 
equal to 0.1, 0.5 and 0.7, were used. 

To generate covariates aq, x 2 and x 3 , random numbers (zi, z 2 , z 3 ) pertaining to a trivari¬ 
ate normal distribution were used. That is, we first generated 



( 

l l 

O O 

l_l 


1 

0.5 

0.5 ' 

\ 

(z 1 ,z 2 ,z 3 ) ~ N 


5 

0.5 

1 

0.5 





_ 0.5 

0.5 

1 

/ 


and then aq, x 2 and x 3 were obtained as aq = 1(<F -I (^i) > 0.5), x 2 = < h” 1 (^ 2 ) and x 3 = 
$ _1 (^ 3 ). Thus, predictor x\ was a binary variable, whereas x\ and X 2 were uniformly 
distributed on (0,1), with correlation approximately equal to 0.5. 

For each generated data set, two models were fitted: one assuming a normal distribution 
for the logarithm of the outcome and the other assuming a gamma distribution for the 
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do 


di 


$2 


f 


Test 



Bias (%) 

RMSE 

Bias (%) 

RMSE 

Bias (%) 

RMSE 

Bias (%) 

RMSE 

error 

Normal Copula 

H 

O 

G 

-5.7 

0.129 

3.8 

0.15 

5.3 

0.105 

-83.5 

0.249 

0.329 

II 

L 

L 

23.7 

0.253 

9.7 

0.272 

11.9 

0.141 

-227.4 

0.426 

0.339 

LO 

O 

G 

-0.4 

0.069 

0.7 

0.077 

1.9 

0.086 

-1 

0.13 

0.321 

II 

L 

L 

30.9 

0.236 

5.3 

0.153 

5.9 

0.111 

-19.3 

0.243 

0.334 

h- 

O 

G 

-0.1 

0.06 

0.5 

0.066 

1.9 

0.084 

0.3 

0.104 

0.321 

II 

L 

32.3 

0.229 

4.2 

0.096 

4.1 

0.098 

-7.1 

0.121 

0.332 

Frank Copula 

t-H 

O 

G 

-6.1 

0.13 

3.2 

0.148 

0.5 

0.094 

-34 

0.245 

0.328 

II 

L 

L 

39.7 

0.32 

-0.4 

0.204 

-3.5 

0.12 

18.9 

0.318 

0.347 

LO 

O 

G 

-2.7 

0.085 

1.4 

0.095 

-0.3 

0.084 

-6.8 

0.18 

0.324 

II 

L 

L 

33 

0.249 

3.2 

0.132 

-0.6 

0.102 

-5.1 

0.205 

0.338 

O' 

h- 

O 

G 

-1.6 

0.07 

0.8 

0.078 

-0.3 

0.084 

-2.5 

0.149 

0.324 

II 

k. 

L 

30.8 

0.225 

4.2 

0.112 

0.1 

0.098 

-5.9 

0.156 

0.335 






Clayton 

Copula 





O 

G 

1.3 

0.058 

-0.4 

0.064 

1.9 

0.094 

5.5 

0.071 

0.322 

II 

L 

L 

32.3 

0.227 

3.9 

0.086 

4.9 

0.107 

-76.5 

0.085 

0.332 

O' 

LO 

O 

G 

0.6 

0.047 

0 

0.054 

1.4 

0.086 

1.7 

0.06 

0.321 

II 

L 

L 

32.4 

0.229 

4.1 

0.091 

4.7 

0.101 

-7.9 

0.093 

0.332 

O 

G 

0.4 

0.045 

0.1 

0.05 

1.1 

0.085 

1.4 

0.052 

0.32 

II 

-b=— 

L 

33.5 

0.233 

3.6 

0.077 

4.1 

0.097 

-1.8 

0.053 

0.332 


Table 3: Comparison of performance between the gamma sample selection model (G) and 
the normal sample selection model in which the logarithm of the outcome variable is used 
(L). 


outcome. The sample size was n = 1000 and the number of Monte Carlo repetitions 300. 
For both fitted models, the relative bias and root of mean square error of estimators /3 0 , Pi, 
fa and Kendall’s t (related to 9) were calculated and are reported in Table 3, which also 
shows test errors of the fitted models. 

In most cases considered the gamma sample selection model outperforms the normal one 
(which employes a log-transformed outcome) in terms of bias, mean squared error and test 
error. This not only shows that the proposed model is flexible enough to accommodate non- 
Gaussian distributions, but also that using a transformed outcome can lead to unreliable 
empirical results. 
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7 Discussion 


We have introduced an extension of the generalized additive model which accounts for non- 
random sample selection. The proposed approach is flexible in that it allows for different dis¬ 
tributions of the outcome variable, several dependence structures between the outcome and 
selection equations, and non-parametric effects on the responses. Parameter estimation with 
integrated automatic multiple smoothing parameter selection is achieved within a penalized 
likelihood and simultaneous equation framework. We have established the asymptotic the¬ 
ory for the proposed penalized spline estimators, and illustrated the empirical effectiveness 
of the approach through a simulation study. A few points are noteworthy. 

• The generalized sample selection model has been formulated using penalized B-splines. 
This allows for simple handling of the model’s theoretical properties. However, in 
practice different smoothers can be used, for example truncated polynomials (which 
yield an equivalent approach as detailed in Kauermann et al. (2009)) or thin plate 
regression splines (Wood, 2006) . 

• The estimation procedure discussed in Section 4 has been implemented in the freely 
available R package SemiParSarapleSel. Currently, the outcome can be modeled using 
the normal, gamma and a number of discrete distributions. The copulae available 
are: normal, Clayton, Joe, Frank, Gumbel, AMH, FGM and their rotated versions. 
Given the modular structure of the estimation algorithm, other copulae and marginal 
distributions can be incorporated in SemiParSarapleSel with little programming work. 

• For simplicity of treatment and notation, the generalized additive sample selection 
model has been defined using as few parameters as possible. However, many model 
structures are allowed within the proposed framework. For example, the association 
parameter 6 can be made dependent on predictors and hence enter the likelihood func¬ 
tion as a transformed linear predictor instead of a scalar. Similarly, even though the 
scale parameter (f> has been set to 1 for simplicity of derivations, additional parameters 
related to the specific distribution employed can be estimated. In both cases, all the 
theoretical derivations presented in the paper still hold. 

• Assumption A4 in Section 5 allows the sequence smoothing parameters X n to grow 
as the sample size increases. This condition is rather week as, in fact, the sequence 
A n based on the mean squared error criterion described in Section A.2 is bounded in 
probability (cf e.g., Kauermann, 2005). Thus the theoretical properties of the penalized 
estimator derived in Section 5 hold even if the smoothing parameters are estimated, 
and not deterministic. 
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An interesting direction of future research will be to compare the small sample per¬ 
formances of the proposed estimator and some of the non-parametric, semiparametric and 
Bayesian methods mentioned in the introduction. Moreover, for many copulae a specific 
value of the association parameter 9 yields a product distribution which indicates lack of 
non-random sample selection. Thus, the important issue of testing hypotheses regarding 
parameter 9 will be addressed in future work. 
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A Algorithmic details 

A.l Trust region algorithm 

Recall that S = (a, (3,9) and define the penalized gradient and Hessian at iteration a as 
Gp^ = and H^ — S^. Each iteration of the trust region algorithm solves 

the problem 

mm £ p {6 [a] ) d = - |4(5 [a] ) + p T Gj“l + ^p T ij[ al p| such that ||p|| < r [a \ 

^b+h _ ar g m j n $ p (§\- a 1) + <5^, 

p 

where || • || denotes the Euclidean norm, and r^ is the radius of the trust region. At 
each iteration of the algorithm, is minimized subject to the constraint that the 

solution falls within a trust region with radius r^. The proposed solution is then accepted 
or rejected and the trust region expanded or shrunken based on the ratio between the 
improvement in the objective function when going from <5^ to and that predicted by 

the quadratic approximation. See Geyer (2013) for the exact details (e.g., numerical stability 
and termination criteria) of the implementation used here. It is important to stress that near 
the solution the trust region method typically behaves as a classic unconstrained algorithm 
(Geyer, 2013; Nocedal & Wright, 2006). Starting values for the coefficients in a and [3 are 
obtained by fitting the selection and outcome equations separately. The initial parameter 
of 9 is set to zero as there is not typically good a priori information about the direction 
and strength of the association between the selection and outcome equations, conditional 
on covariates. 
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A.2 Multiple smoothing parameter estimation 

Let us use the fact that near the solution the trust region algorithm usually behaves as a 
classic Newton or Fisher Scoring method, and assume that <5[ a+1 l is a new updated guess 
for the parameter vector which maximizes i v . If <5^“ +1 is to be ‘correct’, then the penalized 
gradient evaluated at those parameter values would be 0, i.e. Gp a+1 ' = 0. Applying a first 
order Taylor expansion to Gp a+1 ^ about <5^ yields 0 = Gj, a+1 ' ~ G$ + (A a+1 ] — <5A) Hp'\ 
from which we find the solution at iteration a + 1. After some manipulation, this can be 
expressed as 

<5 [a+1] = (l [a] + S A ) 1 \/l [a] z [a \ 

where X^ is —H^ (or, alternatively, — E (fA a l)), and zA = VX^d^ + e^, with = 
\/W ] G^. From standard likelihood theory, e ~ A/" (0,1) and z ~ A/"(/x z ,I), where I is 
an identity matrix, /r z = \/X8 °, and <5° is the true parameter vector. The predicted value 
vector for z is fi z = \[X8 = A^z, where = \[X (X + SjJ 1 \[X. Since our goal is to 
select the smoothing parameters in as parsimonious manner as possible so that the smooth 
terms’ complexity which is not supported by the data is suppressed, A is estimated so that 
fi z is as close as possible to /r z . This can be achieved using 

E (||/x z - /i z || 2 ) = E (11z - A a z - e|| 2 ) 

= E (11z - A a z|| 2 ) + E (—e T e - 2e T /r z + 2e T A A /r z + 2e T A A e) 

= E (||z - A a z|| 2 ) -n + 2tr(A A ), 

where n = 3 n and tr(A A ) is the number of effective degrees of freedom of the penalized 
model. Lienee, the smoothing parameter vector is estimated by minimizing an estimate of 
the expectation above, that is 

V(A) = ||z - A a z|| 2 - n + 2tr(A A ), (20) 

which is equivalent to the expression of the Un-Biased Risk Estimator given in Wood (2006, 
Chapter 4). This is also equivalent to the Akaike information criterion after dropping the 
irrelevant constant; the first term on the right hand side of (20) is a quadratic approximation 
to —21(8) to within an additive constant. In practice, given the problem becomes 

A [a+1] = arg min V(A) = ||z [a+1] - Af +1] z [a+1] || 2 -h + 2tr(A^ +1] ), (21) 

A 

which is solved using the automatic stable and efficient computational routine by Wood 
(2004). 
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B Proofs of Lemmas 


Proof of Lemma 2. We have 
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From Lemma 1 of Yoshida & Naito (2012) we obtain that elements of matrices ^xj 1 ^ j 
are of order O for j = 1,..., Di, and elements of matrices ^Xj 1 ^ x\ 1} are of order 

O (^T j for j ^ l, j,l = 1,..., D\. The same boundaries hold for the matrices xj 2 \ 
j = 1,...,D 2 . 

Thus elements of matrices (X (1 ^) 2 X (1 ) and (X^) 2 X^ 2 ^ are of order O a straight¬ 

forward way, the result also extends to the matrix (X (1) ) 2 X (2 d 
Now we consider the order of the diagonal elements w^\ ..., wf\ It holds 

<E s o\Y 2l \ + \b'([3 0 )\<2E so \Y 2l \. (22) 
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dr) 


(V2 


2i 


cy 2 


(v - b\f3 0 ))f 2 i(v)dv 


and 

f )2 1 7 . ry 2 

= (1 - b"(r) 2i ) + (v - b'(r) 2 i)) 2 ) f 2 i(v)dv. (23) 

CfT l2i J- oo 

Thus \ f ^r{y 2 )\ < 1 +2Var <5 o(Y 2 *)- This combined with (22) and assumptions (Al) and (A2) 
yields E(iu^) = 0(1) for j = 1,..., 6. 

Moreover, by the properties of B-splinc basis the (i,/)th components of ^xj 1 ^ X^, for 

j,k = 1,..., Di, and ^xj 2) j xjf\ for j, k = 1,..., D 2 , equal 0 if \i — l\ > p. Hence the 
matrices (Xd)) T WiX^, (x( 2 )) T W 2 X( 2 ) are band matrices and the assertion follows. 

□ 


Proof of Lemma 3. We use induction w.r.t. the number of variables. Let M n = Di+D 2 +1 
and matrix Um„ be defined as 


U Mn = F n ,p(d°) = a 


U Mn -1 R T 
R A 


The result of Horn & Johnson (1985) yields 


UmI = 


U M l -1 + U^IFV-'RU- 1 


M n -1 


- UmUr t v - 1 
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-V~ l RU 


Mr,- 1 


U\ 


Mr ,-1 
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where V 1 = A — RUjJ _, R 1 . Then assertion can be proven similarly to Kauermann et ah 
(2009) by using the fact that matrices (X«) T WiXW (X^) T and (X^) T 

are band matrices and the properties of the inverse of band matrices listed in Demko (1977). 

□ 

Proof of Lemma 4 (sketch). 

H n (S°) - F n (S°) = X T (W - E(W))X. 

It holds Wi — E (wf) = Op(n -1 / 2 ) as every Wi is a sum of independent and bounded random 
variables. Moreover, 

1 " 

- V (B_ r+ ,(x.)B_ p+ ,(x.)) 2 = 0(K-'). 

' 1 

1=1 

Hence Y^i=\ V ai ( w iB- p +j(xi)B_ p+ i(xi)) = 0(n/K n ) which yields the assertion. 

□ 
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